[drafr] [fix] [client] Fix IO buffer overflow when resend msg after producer reconnect #21351

poorbarcode · 2023-10-12T14:01:42Z

Motivation

If too many messages were called sendAsync when the state of the producer is reconnecting, after connection successfully, the messages cached in the memory will be flushed to the socket once, then the buffer overflow.

Got exception java.lang.IllegalStateException: buffer queue length overflow: 2143977581 + 4195072
	at io.netty.channel.AbstractCoalescingBufferQueue.incrementReadableBytes(AbstractCoalescingBufferQueue.java:368)
	at io.netty.channel.AbstractCoalescingBufferQueue.add(AbstractCoalescingBufferQueue.java:100)
	at io.netty.channel.AbstractCoalescingBufferQueue.add(AbstractCoalescingBufferQueue.java:84)
	at io.netty.handler.ssl.SslHandler.write(SslHandler.java:758)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at org.apache.pulsar.common.protocol.ByteBufPair$CopyingEncoder.write(ByteBufPair.java:149)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at org.apache.pulsar.client.impl.ProducerImpl.recoverProcessOpSendMsgFrom(ProducerImpl.java:2283)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$resendMessages$17(ProducerImpl.java:1918)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:833)

Modifications

To avoid IO buffer overflow, split to multi-flush.
TODO add a test

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: x

…reconnect

lhotari · 2023-10-12T20:58:27Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java

@@ -2284,12 +2285,18 @@ private void recoverProcessOpSendMsgFrom(ClientCnx cnx, MessageImpl from, long e
            if (stripChecksum) {
                stripChecksum(op);
            }
+            // To avoid IO buffer overflow, split to multi-flush.
+            if (messageBytesSizeInCache + op.cmd.readableBytes() < messageBytesSizeInCache) {
+                cnx.ctx().flush();


flush is an async operation, so this might not work as expected

Since write and flush will be executed in the same thread, the actions of both write and flush will keep the order to execute. So it will work as expected.

I mean from a different aspect. I guess that in this case, the purpose of flushing is to ensure that buffers don't overflow. It feels like the solution in this PR won't fully address that since all operations will be queued up in any case unless the logic that calls write is also run in the ctx loop (thread). Other possibility would be to somehow wait for flush completion until it finishes.

In Netty, there's the Channel.isWritable method and channelWritabilityChanged callback that help in keeping the queued writes bounded (see low/high watermark options). However, IIRC, Pulsar code base doesn't show examples of how write logic could be implemented to take advantage of this type of backpressure solution.
Optimally, the logic would be implemented in a way where more writes are added while the channel is writable and then paused. Adding more writes should resume after the channel becomes writable again.

lhotari

I don't think that flushing will help resolve the issue.

In Netty, there's the Channel.isWritable method and channelWritabilityChanged callback that help in keeping the queued writes bounded (see low/high watermark options). However, IIRC, Pulsar code base doesn't show examples of how write logic could be implemented to take advantage of this type of backpressure solution.
Optimally, the logic would be implemented in a way where more writes are added while the channel is writable and then paused. Adding more writes should resume after the channel becomes writable again.

merlimat · 2023-10-22T19:06:57Z

I'm still not sure how it gets to 2GB of accumulated data. The client memory limit backpressure should have blocked long before the 2GB.

[fix] [client] Fix IO buffer overflow when resend msg after producer …

c2cc9b4

…reconnect

poorbarcode self-assigned this Oct 12, 2023

poorbarcode added the type/bug The PR fixed a bug or issue reported a bug label Oct 12, 2023

poorbarcode added this to the 3.2.0 milestone Oct 12, 2023

poorbarcode added release/3.0.2 release/2.11.3 release/2.10.6 labels Oct 12, 2023

poorbarcode changed the title ~~[fix] [client] Fix IO buffer overflow when resend msg after producer reconnect~~ [drafr] [fix] [client] Fix IO buffer overflow when resend msg after producer reconnect Oct 12, 2023

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 12, 2023

lhotari reviewed Oct 12, 2023

View reviewed changes

lhotari requested changes Oct 17, 2023

View reviewed changes

poorbarcode added release/3.0.3 and removed release/3.0.2 labels Oct 24, 2023

shibd added release/2.11.4 and removed release/2.11.3 labels Oct 26, 2023

Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023

coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024

lhotari added release/3.0.8 and removed release/3.0.3 labels Oct 10, 2024

lhotari modified the milestones: 4.0.0, 4.1.0 Oct 11, 2024

lhotari added release/3.0.9 and removed release/3.0.8 labels Dec 6, 2024

lhotari added release/3.0.10 and removed release/3.0.9 labels Jan 3, 2025

lhotari added release/3.0.11 and removed release/3.0.10 labels Feb 19, 2025

lhotari added release/3.0.12 and removed release/3.0.11 labels Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[drafr] [fix] [client] Fix IO buffer overflow when resend msg after producer reconnect #21351

[drafr] [fix] [client] Fix IO buffer overflow when resend msg after producer reconnect #21351

poorbarcode commented Oct 12, 2023

lhotari Oct 12, 2023

poorbarcode Oct 14, 2023

lhotari Oct 14, 2023

lhotari Oct 17, 2023

lhotari left a comment

merlimat commented Oct 22, 2023

[drafr] [fix] [client] Fix IO buffer overflow when resend msg after producer reconnect #21351

Are you sure you want to change the base?

[drafr] [fix] [client] Fix IO buffer overflow when resend msg after producer reconnect #21351

Conversation

poorbarcode commented Oct 12, 2023

Motivation

Modifications

Documentation

Matching PR in forked repository

lhotari Oct 12, 2023

Choose a reason for hiding this comment

poorbarcode Oct 14, 2023

Choose a reason for hiding this comment

lhotari Oct 14, 2023

Choose a reason for hiding this comment

lhotari Oct 17, 2023

Choose a reason for hiding this comment

lhotari left a comment

Choose a reason for hiding this comment

merlimat commented Oct 22, 2023